cars data

Scatterplot of cars data


I will work with R’s internal dataset on cars: cars. There are two variables in the dataset, speed [speed] and distance [dist] this is what they look like.

We require two things:

  1. A claim based on target/hypothesized value
    • Null equal and alternative two-sided [not equal]: \(\mu = target\) vs. \(\mu \neq target\)
    • Null less than or equal and alternative greater: \(\mu \leq target\) vs. \(\mu > target\)
    • Null greater than or equal and alternative lesser: \(\mu \geq target\) vs. \(\mu < target\)
  2. A confidence level that is a threshold in probability for rendering a true/false answer.

Hypothesis Testing and \(t\)

I will work with the speed variable. On a basic level, a hypothesis is a specific value for the average here. For whatever reason, \(\mu=17\) motivates us. I consider all three distinct hypotheses though this is not entirely proper in the sense that any given application of hypothesis testing should have a clear hypothesis about the variable of interest rather than waffling amongst all of them as I will. I do so only for illustrative purposes. To begin, I need to specify both the hypothesis and the confidence level that I intend to use to evaluate it.

Before getting to equations, something is important to note. The mean of the data, \(\overline{x}\), \(s\), the standard deviation, and \(n\) are all known from the data. There are then only two remaining unknowns, either \(t\) or \(\mu\).

The \(t\) equation is given by:

\[t=\frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}}\]

One further algebraic manipulation before starting. Let’s solve for \(\mu\).

\[\mu = \overline{x} - t\left(\frac{s}{\sqrt{n}}\right)\]

This fully defines the parts we require and the core problem because the data essentially gives us almost all of the unknowns. In some very basic way, conditional on data, only \(\mu\) or \(t\) from a given probability remain unknown.

Details of the Data


\[t = \frac{15.4 - 17}{\frac{s}{\sqrt{n}}} = \frac{-1.6}{0.7478}=-2.14\]

There are 2.14 standard errors between the mean that we obtain from the data and the hypothetical mean of 17; the value from the data is smaller, hence the negative.

The middle 90% of the simulated means range from 14.18 to 16.58.

Speed from cars: The Distribution of the Average
Mean SD SE N Percentile of Simulated Means
P05 P95 P10 P90
15.4 5.288 0.748 50 14.18 16.58 14.48 16.32

Case 1: A Two-sided Alternative


Critical Values
P(-1.677 < X < 1.677)     = 0.9
1 - P(-1.677 < X < 1.677) = 0.1

Result: The p-value
P(X < -2.14) = 0.019
P(X > 2.14) = 0.019
P(-2.14 < X < 2.14)     = 0.963
1 - P(-2.14 < X < 2.14) = 0.037

Knowing only the sample size, 50, is sufficient to determine what \(t\) must be to reject a mean of 17 in favor of the alternative that the true mean is not 17. With 0.9 probability, this implies two boundaries; either the true mean is smaller than 17, with 0.05 probability spanning 0 to 0.05 and it is bigger than 17 with 0.05 probability spanning 0.95 to 1 so that the interior range represents 0.9 probability as required.

Case 2: A Lesser Alternative


Critical Values
P(X < -1.299) = 0.1
P(X > -1.299) = 0.9

Result: The p-value
P(X < -2.14) = 0.019
P(X > -2.14) = 0.981

Knowing only the sample size, 50, is sufficient to determine what \(t\) must be to reject a mean of 17 or greater in favor of the alternative that the true mean is less than 17. With 0.9 probability, either the true mean is smaller than 17, with 0.1 probability or it is 17 or bigger with 0.9 probability as required.

Case 3: A Greater Alternative


Critical Values
P(X < -1.299) = 0.1
P(X > -1.299) = 0.9

Result: The p-value
P(X < -2.14) = 0.019
P(X > -2.14) = 0.981

Knowing only the sample size, 50, is sufficient to determine what \(t\) must be to reject a mean of 17 or greater in favor of the alternative that the true mean is less than 17. With 0.9 probability, either the true mean is smaller than 17, with 0.1 probability or it is 17 or bigger with 0.9 probability as required.

t.test

Alternative: Two-sided


    One Sample t-test

data:  cars$speed
t = -2.1397, df = 49, p-value = 0.03739
alternative hypothesis: true mean is not equal to 17
90 percent confidence interval:
 14.1463 16.6537
sample estimates:
mean of x 
     15.4 

Alternative: Less


    One Sample t-test

data:  cars$speed
t = -2.1397, df = 49, p-value = 0.01869
alternative hypothesis: true mean is less than 17
90 percent confidence interval:
     -Inf 16.37143
sample estimates:
mean of x 
     15.4 

Alternative: Greater


    One Sample t-test

data:  cars$speed
t = -2.1397, df = 49, p-value = 0.9813
alternative hypothesis: true mean is greater than 17
90 percent confidence interval:
 14.42857      Inf
sample estimates:
mean of x 
     15.4 

radiant

Alternative: Two-sided

Single mean test
Data      : cars 
Variable  : speed 
Confidence: 0.9 
Null hyp. : the mean of speed = 17 
Alt. hyp. : the mean of speed is not equal to 17 

   mean  n n_missing    sd    se    me
 15.400 50         0 5.288 0.748 1.254

 diff    se t.value p.value df     5%    95%  
 -1.6 0.748   -2.14   0.037 49 14.146 16.654 *

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Alternative: Less

Single mean test
Data      : cars 
Variable  : speed 
Confidence: 0.9 
Null hyp. : the mean of speed = 17 
Alt. hyp. : the mean of speed is < 17 

   mean  n n_missing    sd    se    me
 15.400 50         0 5.288 0.748 1.254

 diff    se t.value p.value df   0%    90%  
 -1.6 0.748   -2.14   0.019 49 -Inf 16.371 *

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Alternative: Greater

Single mean test
Data      : cars 
Variable  : speed 
Confidence: 0.9 
Null hyp. : the mean of speed = 17 
Alt. hyp. : the mean of speed is > 17 

   mean  n n_missing    sd    se    me
 15.400 50         0 5.288 0.748 1.254

 diff    se t.value p.value df    10% 100%  
 -1.6 0.748   -2.14   0.981 49 14.429  Inf  

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Confidence Interval: Analytics


Speed from cars: The Distribution of the Average
Mean SD SE N Percentile of Simulated Means
P05 P95 P10 P90
15.4 5.288 0.748 50 14.18 16.58 14.48 16.32

\[\mu = \overline{x} + t_{49}*\left(\frac{s}{\sqrt{n}}\right)\]

Analytically, if 90% of \(t\) is between -1.677 and 1.677, then the central 90% of the distribution of averages given the data should range from

\[\mu = 15.4 - (-1.677,1.677)*\left(\frac{5.288}{\sqrt{50}}\right)\] which simplifies to: 14.15 to 16.65.

In the resampled averages, this is 14.18 to 16.58.

90% of the \(t\) is bigger than -1.299, so \(\mu\) should be greater than

\[15.4 - 1.299*0.7478 = 14.43.\] In the resampled averages, 14.48 is the 10th percentile.

90% of \(t\) is smaller than 1.299, so \(\mu\) should be smaller than

\[15.4 + 1.299*0.7478 = 16.37.\] In the resampled averages, the 90th percentile is 16.32.